Natural gradient

Background Story

Gradient descent is not efficient in variational inference, because probability distributions do not naturally live in Euclidean space but rather on a statistical manifold. There are better ways of defining the distance between distributions, one of the simplest being the symmetrized Kullback-Leibler divergence:

KLsym(p1,p2)=12(KL(p1||p2)+KL(p2||p1))

In differential geometry, distance on a manifold is given by the bilinear form
dϕ2=dϕ,G(ϕ)dϕ=ijgij(ϕ)dϕidϕj

The matrix G(ϕ)=[gij(ϕ)] is called the Riemannian metric tensor.
In Euclidean space with an orthonormal basis G(ϕ) is simply the identity matrix. When Φ is a space of parameters of probability distributions and the symmetrized KL divergence is used to measure the distance between distributions then G(ϕ) turns out to be the Fisher information matrix:
((θ))i,j=E[(θilogf(X;θ))(θjlogf(X;θ))θ].

The Story

In gradient ascent (of the evidence lower bound in variational inference), we want to maximize:

L(ϕ+dϕ)=L(ϕ)+ϵL(ϕ)Tv

with constraint: v2=v,G(ϕ)v=1. Solve with Lagrange mulitpliers, we get the natural gradient by multiplying the inverse of the Fisher information matrix and the first derivative:
G(ϕ)1L(ϕ)

reference

The Natural Gradient: https://hips.seas.harvard.edu/blog/2013/01/25/the-natural-gradient/